Contents [hide]

Feature Analysis for Time Series

Authors: Isadora Nun, Pavlos Protopapas

Contributors: Karim Pichara, Daniel Acuña, Nicolás Castro, Cristobal Mackenzie, Andrés Riveros and Ming Zhu

Introduction

A time series is a sequence of observations (data points) that are arranged based on the time of their ocurrence. The hourly measurement of wind speeds in meteorology, the minute by minute recording of electrical activity along the scalp in electroencephalography, the weekly changes of stock prices in finances, are just some examples among many others.

Some of the main properties one would expect to find in a time series are [1]:

  • the data is not generated independently
  • their dispersion varies in time
  • they are often governed by a trend and they have cyclic components

The study and analysis of time series can have multiple ends: get a better understanding of the mechanism generating the data, predict future outcomes and behaviours, classification and characterization of events, etc.

In [325]:
animation.FuncAnimation(fig, animate, init_func=init,
                        frames=100, interval=200, blit=True)
Out[325]:


Once Loop Reflect

In time-domain astronomy, data gathered from the telescopes is usually represented in the form of light-curves. These are time series that show the brightness variation of an object through a period of time (for a visual representation see video below). Based on the variability characteristics of the light-curves, stars can be classified into different groups (quasars, long period variables, eclipsing binaries, etc.) and consequently be studied in depth independentely.

In order to characterize this variability, most of the existing methods use machine learning algorithms that build their decision on the light-curves features. Features, the topic of the following work, are numerical descriptors that aim to characterize and distinguish the different variability classes. They can go from basic statistical measures such as the mean or the standard deviation, to complex time-series characteristics such as the autocorrelation function.

In this document we present a library with a compilation of some of the existing light-curve features. The main goal is to create a collaborative and open tool where every user can characterize or analyze an astronomical photometric database while also contributing to the library by adding new features. However, it is important to highlight that this library is not restricted to the astronomical field and could also be applied to any kind of time series.

Our vision is to be capable of analyzing and comparing light-curves from all the available astronomical databases in a standard and universal way. This would facilitate and make more efficient tasks as modelling, classification, data cleaning, outlier detection and data analysis in general. Consequently, when studying light-curves, astronomers and data analysts would be on the same wavelength and would not have the necessity to find a way of comparing or matching different features. In order to achieve this goal, the library should be run in every existent survey (MACHO, EROS, Ogle, LSST, etc) and the results should be ideally shared in the same open way as this library. We invite every user having a survey database to use our library and to contribute with the results in favour of our goal of creating a universal astronomical survey.

Video 1: Light-Curve of Triple Star

The video below shows how the data gathered from the brightness intensity of a star through time results on a light-curve. In this particular case we are observing a complex triple system in which three stars have mutual eclipses as each of the stars gets behind or in front of the others.

In [24]:
from IPython.display import YouTubeVideo
YouTubeVideo('qMx4ozpSRuE',  width=750, height=360, align='right')
Out[24]:

Figure 1: Some variability classes

The following picture [5] presents example light-curves of each class in the MACHO survey. The x-axis is the modified Julian Date (MJD), and the y-axis is the MACHO B-magnitude.

In [25]:
picture = Image(filename='curvas_ejemplos11.jpg')
picture.size = (100, 100)
picture
Out[25]:

The library

Library structure

The library is coded in python and can be downloaded for free from the github website https://github.com/isadoranun/tsfeat. The main idea is that any user can run it in its own database but can also add new features through the github system. For a quick guide on how to use github visit https://guides.github.com/activities/hello-world/.

The library receives as minimum input the light-curve magnitude (light intensity) through time and returns as output a vector with the calculated features. The structure is divided into two main parts. Part one, Feature.py, is a wrapper class that allows the user to select the category of the features to be calculated or to specify a list of them. Part two, FeatureFunciontLib.py, contains the actual code for calculating the features. Each feature has its own class with at least two functions:

init: receives the necessary inputs (other than the magnitude) for the feature calculation (ex: observational error, observational time, magnitude in a different color band, etc.). It also defines the feature category, "times-series" or "basic".

fit: returns the calculated feature. The output can only take one value; features like the autocorrelation function must consequently be summarized in one single scalar.

The following code is an example of a class in FeatureFunciontLib.py that calculates the color feature (difference of the mean magnitude in two different bands) :

class B_R(Base):

   #Lightcurve average color  
   #mean(B) - mean(R)

    def __init__(self, second_data):
    self.category='timeSeries'
    if second_data is None:
        print "please provide another data series to compute B_R"
        sys.exit(1)
    self.data2 = second_data

    def fit(self, data):
    return np.mean(data) - np.mean(self.data2)

If the user wants to contribute with feature(s) to the library, it (they) must be added to FeatureFunctionLib.py following the explained format. There is no need to modify Feature.py.

Choosing the features

The library we created allows the user to either choose the specific features of interest to be calculated or to calculate them all simultaneously. Also, the features can be calculated based on their category: "basic" or "time series".

By default the library receives only the magnitude data as input. As explained above, some features need extra data as the times of measurement or the associated error, this must be specified as a parameter of the feature.

The list of all the possible features with their corresponding categories, parameters and literature source is presented in the following table:

In [240]:
make_table(FeaturesList)
apply_theme('basic')
set_global_style(float_format='%0.3E')
Out[240]:
FeatureCategoryParameter(s)Reference
[Amplitude](#Bmean)basicRichards et al., 2011
Anderson-Darling testtimeSeriesKim et al., 2008
Automeanbasic2 parameters?
AutocortimeSeriesKim et al., 2011
B_RtimeSeriessecond_dataKim et al., 2011
Beyond1StdLbasicerrorRichards et al., 2011
BmeanbasicKim et al., 2014
CAR_sigmatimeSeriesmjd, errorPichara et al., 2012
CAR_meantimeSeriesPichara et al., 2012
CAR_tautimeSeriesPichara et al., 2012
ContimeSeriesKim et al., 2011
Eta_B_RtimeSeriesaligned_second_data, aligned_data, aligned_mjdKim et al., 2014
Eta_etimeSeriesmjdKim et al., 2014
FluxPercentileRatioMid20basicRichards et al., 2011
FluxPercentileRatioMid35basicRichards et al., 2011
FluxPercentileRatioMid50basicRichards et al., 2011
FluxPercentileRatioMid65basicRichards et al., 2011
FluxPercentileRatioMid80basicRichards et al., 2011
LinearTrendtimeSeriesmjdRichards et al., 2011
MaxSlopetimeSeriesmjdRichards et al., 2011
MeanvariancebasicKim et al., 2011
MedianAbsDevbasicRichards et al., 2011
MedianBRPbasicRichards et al., 2011
PairSlopeTrendtimeSeriesRichards et al., 2011
PercentAmplitudebasicRichards et al., 2011
PercentDifferenceFluxPercentilebasicRichards et al., 2011
PeriodLStimeSeriesmjdKim et al., 2011
Period_fittimeSeries Kim et al., 2011
Psi_CStimeSeriesmjdKim et al., 2014
Psi_etatimeSeriesKim et al., 2014
Q31basicKim et al., 2014
Q31B_RtimeSeriesaligned_second_data, aligned_dataKim et al., 2014
RcstimeSeriesKim et al., 2011
SkewbasicRichards et al., 2011
SlottedAtimeSeriesmjd, TScott et al., ?
SmallKurtosisbasicRichards et al., 2011
StdbasicRichards et al., 2011
StetsonJtimeSeriesaligned_second_data, aligned_dataRichards et al., 2011
StetsonKtimeSeriesRichards et al., 2011
StetsonK_ACtimeSeriesKim et al., 2011
StetsonLtimeSeriesaligned_second_data, aligned_dataKim et al., 2011
VariablityIndextimeSeriesKim et al., 2011

Some examples of how to use the library are presented next:

- List of features as an input:

In [29]:
a = FeatureSpace(featureList=['Std','StetsonL'], StetsonL=[aligned_second_data, aligned_data])
a=a.calculateFeature(data)

Table(a)
Out[29]:
FeatureValue
Std0.14157
StetsonL0.32184

- Category as an input:

In [30]:
a = FeatureSpace(category=['timeSeries'], Automean=[0,0], StetsonL=[aligned_second_data, aligned_data] ,  B_R=second_data, Beyond1Std=error, StetsonJ=[aligned_second_data, aligned_data], MaxSlope=mjd, LinearTrend=mjd, Eta_B_R=[aligned_second_data, aligned_data, aligned_mjd], Eta_e=mjd, Q31B_R=[aligned_second_data, aligned_data], PeriodLS=mjd, Psi_CS=mjd, CAR_sigma = [mjd, error**2], SlottedA = [mjd, 4])
a=a.calculateFeature(data)

Table(a)
Out[30]:
FeatureValue
AndersonDarling1.00000
Autocor1.00000
B_R-0.33326
CAR_sigma1.53866
CAR_tau0.01678
CAR_tmean-352.77218
Con0.00000
Eta_B_R12930.68526
Eta_e905.63620
LinearTrend0.00001
MaxSlope54.72526
PairSlopeTrend0.03333
PeriodLS0.93697
Period_fit0.00000
Psi_CS0.19179
Psi_eta0.58910
Q31B_R0.10600
Rcs0.03917
SlottedA4.00000
StetsonJ0.55385
StetsonK0.76094
StetsonK_AC0.86733
StetsonL0.32184

- All the features:

In [31]:
a = FeatureSpace(category='all',featureList=None, Automean=[0,0], StetsonL=[aligned_second_data, aligned_data] ,  B_R=second_data, Beyond1Std=error, StetsonJ=[aligned_second_data, aligned_data], MaxSlope=mjd, LinearTrend=mjd, Eta_B_R=[aligned_second_data, aligned_data, aligned_mjd], Eta_e=mjd, Q31B_R=[aligned_second_data, aligned_data], PeriodLS=mjd, Psi_CS=mjd, CAR_sigma=[mjd, error], SlottedA = [mjd, 4])
a=a.calculateFeature(data)

Table(a)
Out[31]:
FeatureValue
Amplitude0.26600
AndersonDarling1.00000
Autocor1.00000
Automean-5.91799
B_R-0.33326
Beyond1Std0.22278
Bmean-5.91799
CAR_sigma-0.21928
CAR_tau0.64112
CAR_tmean-9.23070
Con0.00000
Eta_B_R12930.68526
Eta_e905.63620
FluxPercentileRatioMid200.08929
FluxPercentileRatioMid350.17857
FluxPercentileRatioMid500.31473
FluxPercentileRatioMid650.52232
FluxPercentileRatioMid800.80134
LinearTrend0.00001
MaxSlope54.72526
Meanvariance-0.02392
MedianAbsDev0.05450
MedianBRP0.74539
PairSlopeTrend0.03333
PercentAmplitude-0.11309
PercentDifferenceFluxPercentile-0.07511
PeriodLS0.93697
Period_fit0.00000
Psi_CS0.19179
Psi_eta0.58910
Q310.14100
Q31B_R0.10600
Rcs0.03917
Skew0.95647
SlottedA4.00000
SmallKurtosis1.37948
Std0.14157
StetsonJ0.55385
StetsonK0.76094
StetsonK_AC0.86733
StetsonL0.32184

Library input: the light-curves

In order to calculate the features, it is first necessary to import the data in the right format. A light-curve should be an array composed by at least four vectors: magnitude in two different bands, time of measurement and the associated observational error. For example, the function LeerLC_MACHO() receives a MACHO id (object id assigned in the MACHO survey) as an input and returns the following output:

  • data: magnitude measurement (blue band)
  • mjd: time of measurement (blue band)
  • error: associated observational error (blue band)
  • second_data: magnitude measurement (red band)

A demostration of how to import a MACHO light-curve is presented below. Besides opening the file, the data is:

  • preprocessed: points within 5 standard deviations from the mean are considered as noice and thus are eliminated.
  • synchronized: when observed in different bands, light-curves of a same object are not always monitored for the same time length and at the same precise times. For some features, it is important to align the light-curves and to only consider the simultaneous measurements from both bands.
In [32]:
#We open the ligth curve in two different bands
lc_B = LeerLC_MACHO('lc_1.3444.614.B.mjd')  #58.6272.729 1.3444.614 1.4652.1527
lc_R = LeerLC_MACHO('lc_1.3444.614.R.mjd')

#We import the data
[data, mjd, error] = lc_B.leerLC()
[data2, mjd2, error2] = lc_R.leerLC()

#We preprocess the data
preproccesed_data = Preprocess_LC(data, mjd, error)
[data, mjd, error] = preproccesed_data.Preprocess()

preproccesed_data = Preprocess_LC(data2, mjd2, error2)
[second_data, mjd2, error2] = preproccesed_data.Preprocess()

#We synchronize the data
if len(data) != len(second_data):
    [aligned_data, aligned_second_data, aligned_mjd] = Align_LC(mjd, mjd2, data, second_data, error, error2)

It is sometimes helpful to visualize the data before processing it. For a representation of the light curve, we can plot it as follows:

In [33]:
Color = [ 1 ,0.498039, 0.313725];
p = plt.plot(mjd, data, '*-', color=Color)
plt.xlabel("MJD")
plt.ylabel("Magnitude")
plt.gca().invert_yaxis()
print np.std(data)/np.mean(data)
-0.0239225135894

Note: for periodic light-curves we are able to transform the photometric time series into a single light-curve in which each period is mapped onto the same time axis as follows: $$ t'=\{\frac{t-t_0}{T}\} $$

where $T$ is the period, $t_0$ is an arbitrary starting point and the symbol {} represents the non-integer part of the fraction. This process produces a folded light-curve on an x-axis of folded time that ranges from 0 to 1. The corresponding folded light-curve of the previous example is shown next:

In [34]:
Color = [ 0.392157, 0.584314 ,0.929412];
T = 2 * 0.93697446
new_b=np.mod(mjd, T) / T;
idx=np.argsort(2*new_b)
plt.plot( new_b, data, '*', color = Color)
plt.xlabel("Phase")
plt.ylabel("Magnitude")
plt.gca().invert_yaxis()

Library output

When calculating the features of a light-curve, the output can be returned in three different formats:

  • "dict": returns a vector with the name of each feature followed by its value.
  • "array": returns a vector with only the features values.
  • "features" : returns a vector with the list of the calculated features.

The following example shows how to implement the first option:

In [35]:
a.result(method='dict')
Out[35]:
{'Amplitude': 0.26600000000000001,
 'AndersonDarling': 1.0,
 'Autocor': 1,
 'Automean': -5.9179891122278052,
 'B_R': -0.33325502453332145,
 'Beyond1Std': 0.22278056951423786,
 'Bmean': -5.9179891122278052,
 'CAR_sigma': -0.21928049298842511,
 'CAR_tau': 0.64112037377348619,
 'CAR_tmean': -9.230698873903961,
 'Con': 0.0,
 'Eta_B_R': 12930.685257570141,
 'Eta_e': 905.63620081228805,
 'FluxPercentileRatioMid20': 0.089285714285714468,
 'FluxPercentileRatioMid35': 0.17857142857142894,
 'FluxPercentileRatioMid50': 0.31473214285714324,
 'FluxPercentileRatioMid65': 0.52232142857142916,
 'FluxPercentileRatioMid80': 0.80133928571428659,
 'LinearTrend': 6.1736585768121618e-06,
 'MaxSlope': 54.725258361167832,
 'Meanvariance': -0.023922513589418142,
 'MedianAbsDev': 0.054499999999999993,
 'MedianBRP': 0.7453936348408711,
 'PairSlopeTrend': 0.03333333333333333,
 'PercentAmplitude': -0.11308575739793782,
 'PercentDifferenceFluxPercentile': -0.075111073853633914,
 'PeriodLS': 0.93697445905023935,
 'Period_fit': 3.1076457440339972e-113,
 'Psi_CS': 0.19179114069585729,
 'Psi_eta': 0.58910268294132195,
 'Q31': 0.14100000000000001,
 'Q31B_R': 0.10600000000000076,
 'Rcs': 0.039171450772665782,
 'Skew': 0.956469867559379,
 'SlottedA': 4,
 'SmallKurtosis': 1.3794786801255068,
 'Std': 0.14157317495929828,
 'StetsonJ': 0.5538531876931454,
 'StetsonK': 0.76093506917809894,
 'StetsonK_AC': 0.8673297341638867,
 'StetsonL': 0.32183544397369113}

The features

The next section details the features that we have developed in order to represent the light-curves. Every feature is described and tested with a suitable experiment to prove its validity.

Note: Each feature was also tested to check its invariance to unequal sampling. This refers to the fact that the telescope observations are not always taken at uniformly spaced intervals. Evidently, light-curve descriptors should be insensitive to this nonuniformity. The tests can be found in the following link.

Bmean

Mean magnitude of the blue band. For a normal distribution it should take a value close to 0:

In [36]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Bmean'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Bmean': -0.015237803017348403}

Standard deviaton

The standard deviation $\sigma$ of the sample is defined as:

$$\sigma=\sum_{i} \frac{(y_{i}-\hat{y})^2}{N-1}$$

In [37]:
a = FeatureSpace(featureList=['Std'])
a=a.calculateFeature(data)
print a.result(method='dict')
{'Std': 0.14157317495929828}

For example, a white noise time serie should have $\sigma=1$

In [38]:
data2 = np.random.normal(size=1000000)
a = FeatureSpace(featureList=['Std' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Std': 0.99916292602071488}

Range of a cumulative sum $R_{cs}$

$R_{cs}$ is the range of a cumulative sum (Ellaway 1978) of each light-curve and is defined as:

$$R_{cs} = max(S) - min(S)$$ $$S_l = \frac{1}{N \sigma} \sum_{i=1}^l \left( m_i - \bar{m} \right) $$

where max(min) is the maximum (minimum) value of S and $l=1,2, \dots, N$.

$R_{cs}$ should take a value close to zero for a normal distribution:

In [39]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Rcs'] )
a=a.calculateFeature(data)
print a.result(method='dict')
{'Rcs': 0.039171450772665782}

Period Lomb-Scargle

The Lomb-Scargle (L-S) algorithm (Scargle, 1982) is a variation of the Discrete Fourier Transform (DFT), in which a time series is decomposed into a linear combination of sinusoidal functions. The basis of sinusoidal functions transforms the data from the time domain to the frequency domain. DFT techniques often assume evenly spaced data points in the time series, but this is rarely the case with astrophysical time-series data. Scargle has derived a formula for transform coefficients that is similiar to the DFT in the limit of evenly spaced observations. In addition, an adjustment of the values used to calculate the transform coefficients makes the transform invariant to time shifts.

The Lomb-Scargle periodogram is optimized to identify sinusoidal-shaped periodic signals in time-series data. Particular applications include radial velocity data and searches for pulsating variable stars. L-S is not optimal for detecting signals from transiting exoplanets, where the shape of the periodic light-curve is not sinusoidal.

Next, we perform a test on synthetical data to confirm the accuracy of the period found by the L-S method:

In [40]:
Color = [ 1 ,0.498039, 0.313725];
N=100
mjd3 = np.arange(N)
Period = 20
cov = np.zeros([N,N])
mean = np.zeros(N)
for i in np.arange(N):
    for j in np.arange(N):
        cov[i,j] = np.exp( -(np.sin( (np.pi/Period) *(i-j))**2))
data3=np.random.multivariate_normal(mean, cov)
plt.plot(mjd3,data3, color=Color)
-c:10: RuntimeWarning: covariance is not positive-semidefinite.
Out[40]:
[<matplotlib.lines.Line2D at 0x10590b610>]
In [41]:
a = FeatureSpace(featureList=['PeriodLS'],PeriodLS=mjd3)
a=a.calculateFeature(data3)
print a.result(method='array')/Period, a.result(method='array'), Period
[ 0.99] [ 19.8] 20

We can also confirm the validity of this result by folding the light-curve as explained in the introduction.

In [42]:
Color = [ 0.392157, 0.584314 ,0.929412];
T = 2 * a.result(method='array')
new_b=np.mod(mjd3, T) / T;
idx=np.argsort(2*new_b)
plt.plot( new_b, data3,'*', color=Color)
#plt.plot(np.sort(new_b), data3[np.argsort(new_b)],'*', color='red')
Out[42]:
[<matplotlib.lines.Line2D at 0x10b41ec10>]

Period fit

Returns the false alarm probability of the largest periodogram value. Let's test it for a normal distributed data and for a periodic one:

In [43]:
data2 = np.random.normal(size=1000)
mjd2=np.arange(1000)
a = FeatureSpace(featureList=['PeriodLS','Period_fit'], PeriodLS=mjd2)
a=a.calculateFeature(data2)
print "Normal data:", a.result(method='dict')

a = FeatureSpace(featureList=['PeriodLS','Period_fit'], PeriodLS=mjd3)
a=a.calculateFeature(data3)
print "Periodic data:", a.result(method='dict')
Normal data: {'Period_fit': 1.0, 'PeriodLS': 9.7622149837133545}
Periodic data: {'Period_fit': 5.665716272086323e-14, 'PeriodLS': 19.800000000000001}

$\Psi_{CS}$

$R_{CS}$ applied to the phase-folded light curve (generated using the period estimated from the Lomb-Scargle method).

In [44]:
a = FeatureSpace(featureList=['PeriodLS','Psi_CS'], PeriodLS=mjd3, Psi_CS =mjd3 )
a=a.calculateFeature(data3)
print a.result(method='dict')
{'PeriodLS': 19.800000000000001, 'Psi_CS': 0.25705852694656101}

Color B_R

The color is defined as the difference between the average magnitude of the blue band and the red band observations. The value should be around zero.

In [45]:
a = FeatureSpace(featureList=['B_R' ],B_R=second_data )
a=a.calculateFeature(data)
print a.result(method='dict')
{'B_R': -0.33325502453332145}

Autocorrelation

The autocorrelation, also known as serial correlation, is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies.

For an observed series $y_1, y_2,\dots,y_T$ with sample mean $\bar{y}$, the sample lag$-h$ autocorrelation is given by:

$$\hat{\rho}_h = \frac{\sum_{t=h+1}^T (y_t - \bar{y})(y_{t-h}-\bar{y})}{\sum_{t=1}^T (y_t - \bar{y})^2}$$

Since the autocorrelation fuction of a light curve is given by a vector and we can only return one value as a feature, we find the lag value where the autocorrelation function is smaller than $e^{-1}$.

In [46]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Autocor'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Autocor': 1}

Slotted Autocorrelation

In slotted autocorrelation, time lags are defined as intervals or slots instead of single values. The slotted autocorrelation function at a certain time lag slot is computed by averaging the cross product between samples whose time differences fall in the given slot.

$$\hat{\rho}(\tau=kh) = \frac {1}{\hat{\rho}(0)\,N_\tau}\sum_{t_i}\sum_{t_j= t_i+(k-1/2)h }^{t_i+(k+1/2)h } \bar{y}_i(t_i)\,\, \bar{y}_j(t_j) $$

In order to check the validity of this feature let's calculate the slotted autocorrelation for a normal distribution with T=1 and compare it with the autocorrelation function.

In [85]:
data2 = np.random.normal(size=1000)
mjd2=np.arange(1000)
Color = [ 0.392157, 0.584314 ,0.929412];
Color2 = [ 1 ,0.498039, 0.313725];
plt.figure(figsize=(10,5))
SAC = slotted_autocorrelation(data2, mjd2, 1, 100)
a, = plt.plot(SAC[0:40], '*', color=Color, markersize=10)
AC = stattools.acf(data2)
b, =plt.plot(AC, color=Color2)
plt.legend([a, b],['Slotted autocorrelation', 'Autocorrelation'])
Out[85]:
<matplotlib.legend.Legend at 0x10a57a650>
In [49]:
data2 = np.random.normal(size=1000)
mjd2=np.arange(1000)

AC = stattools.acf(data2)
k = next((index for index,value in enumerate(AC) if value < np.exp(-1)), None)
print "From the autocorrealtion function:", k

a = FeatureSpace(featureList=['SlottedA'] , SlottedA = [mjd2, 1])
a=a.calculateFeature(data2)
print a.result(method='dict')
From the autocorrealtion function: 1
{'SlottedA': 1}

Stetson K, Stetson K_AC, Stetson J and Stetson L

These three features are based on the Welch/Stetson variability index $I$ defined by the equation: $$ I = \sqrt{\frac{1}{n(n-1)}} \sum_{i=1}^n {\left( \frac{b_i-\hat{b}}{\sigma_{b,i}}\right) \left( \frac{v_i - \hat{v}}{\sigma_{v,i}} \right)} $$

where $b_i$ and $v_i$ are the apparent magnitudes obtained for the candidate star in two observations closely spaced in time on some occasion $i$, $\sigma_{b,i}$ and $\sigma_{v,i}$ are the standard errors of those magnitudes, $\hat{b}$ and $\hat{v}$ are the weighted mean magnitudes in the two filters, and $n$ is the number of observation pairs.

Since a given frame pair may include data from two filters which did not have equal numbers of observations overall, the "relative error" is calculated as follows:

$$ \delta = \sqrt{\frac{n}{n-1}} \frac{v-\hat{v}}{\sigma_v} $$

allowing all residuals to be compared on an equal basis.

- Stetson K

Stetson K is a robust kurtosis measure: $$ \frac{1/N \sum_{i=1}^N |\delta_i|}{\sqrt{1/N \sum_{i=1}^N \delta_i^2}}$$

where the index $i$ runs over all $N$ observations available for the star without regard to pairing. For a Gaussian magnitude distribution K should take a value close to $\sqrt{2/\pi} = 0.798$, let's test it:

In [50]:
data2 = np.random.normal(size=1000)
a = FeatureSpace(featureList=['StetsonK' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'StetsonK': 0.79277008314449438}
- Stetson K_AC

Stetson K applied to the slotted autocorrelation function of the light-curve.

In [51]:
mjd2=np.arange(1000)
a = FeatureSpace(featureList=['StetsonK_AC' ])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'StetsonK_AC': 0.33635380258759784}
- Stetson J

Stetson J is a robust version of the variability index. It is calculated based on two simultaneous light curves of a same star and is defined as:

$$ J = \sum_{k=1}^n sgn(P_k) \sqrt{|P_k|}$$

with $P_k = \delta_{i_k} \delta_{j_k} $

For a Gaussian magnitude distribution, J should take a value close to zero:

In [52]:
data2 = np.random.normal(size=10000)
data3 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['StetsonJ' ], StetsonJ=[data3, data2])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'StetsonJ': -0.020124784778673604}
- Stetson L

Stetson L variability index describes the synchronous variability of different bands and is defined as: $$ L = \frac{JK}{0.798} $$

Again, for a Gaussian magnitude distribution, L should take a value close to zero:

In [53]:
data2 = np.random.normal(size=10000)
data3 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['StetsonL' ], StetsonL=[data3, data2])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'StetsonL': -0.0074019683024082903}

Variability index $\eta$ REMOVED FROM THE LIBRARY

Variability index $\eta$ is the ratio of the mean of the square of successive differences to the variance of data points. The index was originally proposed to check whether the successive data points are independent or not. In other words, the index was developed to check if any trends exist in the data (von Neumann 1941). It is defined as: $$\eta=\frac{1}{\left(N-1 \right)\sigma^2}\sum_{i=1}^{N-1} \left( m_{i+1}-m_i \right)^2 $$

The variability index should take a value close to 2 for a normal distribution:

In [54]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['VariabilityIndex' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
 could not find feature VariabilityIndex
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1
To exit: use 'exit', 'quit', or Ctrl-D.

Variability index $\eta^e$

Although $\eta$ is a poweful index for quantifying variability characteristics of a time series, it does not take into account unequal sampling. Thus $\eta^e$ is defined as:

$$ \eta^e = \bar{w} \left( t_{N-1} - t_1 \right)^2 \frac{\sum_{i=1}^{N-1} w_i \left(m_{i+1} - m_i \right)^2}{\sigma^2 \sum_{i=1}^{N-1} w_i} $$

$$ w_i = \frac{1}{\left( t_{i+1} - t_i \right)^2} $$

In [55]:
data2 = np.random.normal(size=10000)
mjd2=np.arange(10000)
a = FeatureSpace(featureList=['Eta_e' ], Eta_e = mjd2 )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Eta_e': 2.0145774325220933}

Variability index $\eta_{B-R}$

$\eta^e$ index calculated from the B − R light curve.

In [56]:
data3 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Eta_B_R' ], Eta_B_R = [data3, data2, mjd2] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Eta_B_R': 1.9874162350493008}

$\Psi_\eta$

$\eta$ index calculated from the folded light curve.

In [57]:
data2 = np.random.normal(size=1000)
mjd2=np.arange(1000)
a = FeatureSpace(featureList=['PeriodLS','Psi_eta'], PeriodLS = mjd2)
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Psi_eta': 2.0586782806970145, 'PeriodLS': 10.388214904679375}

Small Kurtosis

Small sample kurtosis of the magnitudes: $$ Kurtosis = \frac{N \left( N+1 \right)}{\left( N-1 \right) \left( N-2 \right) \left( N-3 \right)} \sum_{i=1}^N \left( \frac{m_i-\hat{m}}{\sigma} \right)^4 - \frac{3\left( N-1 \right)^2}{\left( N-2 \right) \left( N-3 \right)} $$

For a normal distribution, the small kurtosis should be zero:

In [124]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['SmallKurtosis' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'SmallKurtosis': -0.050955378480892044}

Skewness

The skewness of a sample is defined as follow: $$ Skewness = \frac{N}{\left(N-1\right)\left(N-2\right)} \sum_{i=1}^N \left( \frac{m_i-\hat{m}}{\sigma}\right)^3 $$

For a normal distribution it should be equal to zero:

In [118]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Skew' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Skew': -0.022071078653900532}

Median absolute deviation

The median absolute deviation is defined as the median discrepancy of the data from the median data:

$$Median Absolute Deviation = median\left( |mag - median(mag)|\right) $$

It should take a value close to 0.675 for a normal distribution:

In [60]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['MedianAbsDev' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'MedianAbsDev': 0.67997348441562844}

Amplitude

The amplitude is defined as the half of the difference between the mean of the maximum 5% and the mean of the minimum 5% magnitudes. For a sequence of numbers from 0 to 1000 the amplitude should be equal to 475.5:

In [61]:
data2 = range(1000)
a = FeatureSpace(featureList=['Amplitude' ] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Amplitude': 475.0}

Con

Index introduced for the selection of variable stars from the OGLE database (Wozniak 2000). To calculate Con, we count the number of three consecutive data points that are brighter or fainter than $2\sigma$ and normalize the number by $N-2$.

For a normal distribution and by considering just one star, Con should take values close to 0.045:

In [62]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Con' ] , Con=1)
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Con': 0.0451}

Anderson-Darling test

The Anderson-Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. When applied to testing if a normal distribution adequately describes a set of data, it is one of the most poweful statistical tools for detecting most departures from normality.

For a normal distribution the Anderson-Darling statistic should take values close to 0.25:

In [63]:
b=[]
for i in xrange(5000):
    data2 = np.random.normal(size=10000)
    a = FeatureSpace(featureList=['AndersonDarling' ] )
    a=a.calculateFeature(data2)
    b.extend(a.result())
    
fig = plt.hist(b)

Linear trend

Slope of a linear fit to the light-curve.

In [64]:
data2 = np.random.normal(size=10000)
mjd2=np.arange(10000)
a = FeatureSpace(featureList=['LinearTrend'] ,LinearTrend = mjd2  )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'LinearTrend': -4.240657029212982e-06}

Max slope

Maximum absolute magnitude slope between two consecutive observations

In [65]:
data2 = np.random.normal(size=1000)
mjd2=np.arange(1000)
a = FeatureSpace(featureList=['MaxSlope'] , MaxSlope = mjd2  )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'MaxSlope': 4.129386971213191}

Beyond 1 std

Percentage of points beyond one standard deviation from the weighted mean.

For a normal distribution, it should take a value close to 0.32:

In [66]:
data2 = np.random.normal(size=10000)
error2 = np.random.normal(loc=0.01, scale =0.01, size=10000)
a = FeatureSpace(featureList=['Beyond1Std'] , Beyond1Std = error2 )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Beyond1Std': 0.3131}

Pair slope trend

Considering the last 30 (time-sorted) measurements of source magnitude, the fraction of increasing first differences minus the fraction of decreasing first differences.

In [67]:
data2 = np.random.normal(size=100000)
a = FeatureSpace(featureList=['PairSlopeTrend'])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'PairSlopeTrend': 0.23333333333333334}

Flux percentile ratio mid20, mid 35, mid 50, mid 65 and mid 80

In order to caracterize the sorted magnitudes distribution we use percentiles. If $F_{5,95}$ is the difference between $95\%$ and $5\%$ magnitude values, we calculate the following:

  • flux_percentile_ratio_mid20: ratio $F_{40,60}/F_{5,95}$
  • flux_percentile_ratio_mid35: ratio $F_{32.5,67.5}/F_{5,95}$
  • flux_percentile_ratio_mid50: ratio $F_{25,75}/F_{5,95}$
  • flux_percentile_ratio_mid65: ratio $F_{17.5,82.5}/F_{5,95}$
  • flux_percentile_ratio_mid80: ratio $F_{10,90}/F_{5,95}$

For the first feature for example, in the case of a normal distribution, this is equivalente to calculate $\frac{erf^{-1}(2 \cdot 0.6-1)-erf^{-1}(2 \cdot 0.4-1)}{erf^{-1}(2 \cdot 0.95-1)-erf^{-1}(2 \cdot 0.05-1)}$. So, the expected values for each of the flux percentile features are:

  • flux_percentile_ratio_mid20 = 0.154
  • flux_percentile_ratio_mid35 = 0.275
  • flux_percentile_ratio_mid50 = 0.410
  • flux_percentile_ratio_mid65 = 0.568
  • flux_percentile_ratio_mid80 = 0.779
In [68]:
data2 = np.random.normal(size=100000)
a = FeatureSpace(featureList=['FluxPercentileRatioMid20','FluxPercentileRatioMid35','FluxPercentileRatioMid50','FluxPercentileRatioMid65','FluxPercentileRatioMid80'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'FluxPercentileRatioMid80': 0.77702962908967843, 'FluxPercentileRatioMid50': 0.40982223215030678, 'FluxPercentileRatioMid20': 0.15272208696790535, 'FluxPercentileRatioMid35': 0.27452225669015012, 'FluxPercentileRatioMid65': 0.56863036355327123}

$Q_{3-1}$

$Q_{3-1}$ is the difference between the third quartile, $Q_3$, and the first quartile, $Q_1$, of a raw light curve. $Q_1$ is a split between the lowest $25\%$ and the highest $75\%$ of data. $Q_3$ is a split between the lowest $75\%$ and the highest $25\%$ of data.

In [112]:
data2 = np.random.normal(size=100000)
a = FeatureSpace(featureList=['Q31'])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Q31': 1.3541143776684113}

$Q_{3-1|B-R}$

$Q_{3-1}$ applied to the difference between both bands of a light curve (B-R).

In [116]:
data2 = np.random.normal(size=10000)
data3 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['Q31B_R'] , Q31B_R = [data3, data2])
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Q31B_R': 1.8981866243944057}

Percent difference flux percentile

In [108]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['PercentDifferenceFluxPercentile'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'PercentDifferenceFluxPercentile': 211.28095967276019}

Percent amplitude

Largest percentage difference between either the max or min magnitude and the median.

In [103]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['PercentAmplitude'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'PercentAmplitude': 2010.2969651247706}

Mean variance $\frac{\sigma}{\bar{m}}$

This is a simple variability index and is defined as the ratio of the standard deviation, $\sigma$, to the mean magnitude, $\bar{m}$. If a light curve has strong variability, $\frac{\sigma}{\bar{m}}$ of the light curve is generally large.

For a uniform distribution the mean-variance should take a value close to 0.577:

In [73]:
data2 = np.random.uniform(size=1000000)
a = FeatureSpace(featureList=['Meanvariance'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Meanvariance': 0.57733385945571203}

Median buffer range percentage (MedianBRP)

Fraction of photometric points within amplitude/10 of the median magnitude.

In [74]:
data2 = np.random.normal(size=1000000)
a = FeatureSpace(featureList=['MedianBRP'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'MedianBRP': 0.658927}

CAR features

In order to model the irregular sampled times series we use CAR(1), a continious time auto regressive model. CAR(1) process has three parameters, it provides a natural and consistent way of estimating a characteristic time scale and variance of light-curves. CAR(1) process is described by the following stochastic differential equation:

$$ dX(t) = - \frac{1}{\tau} X(t)dt + \sigma_C \sqrt{dt} \epsilon(t) + bdt, $$ $$for \: \tau, \sigma_C, t \geq 0 $$

where the mean value of the lightcurve $X(t)$ is $b\tau$ and the variance is $\frac{\tau\sigma_C^2}{2}$. $\tau$ is the relaxation time of the process $X(T)$, it can be interpreted as describing the variability amplitude of the time series. $\sigma_C$ can be interpreted as describing the variability of the time series on time scales shorter than $\tau$. $\epsilon(t)$ is a white noise process with zero mean and variance equal to one. The likelihood function of a CAR(1) model for a light-curve with observations $x - \{x_1, \dots, x_n\}$ observed at times $\{t_1, \dots, t_n\}$ with measurements error variances $\{\delta_1^2, \dots, \delta_n^2\}$ is:

$$ p \left( x|b,\sigma_C,\tau \right) = \prod_{i=1}^n \frac{1}{[2 \pi \left( \Omega_i + \delta_i^2 \right)]^{1/2}} exp \{ -\frac{1}{2} \frac{\left( \hat{x}_i - x^*_i \right)^2}{\Omega_i + \delta^2_i} \} $$ $$ x_i^* = x_i - b\tau$$ $$ \hat{x}_0 = 0 $$ $$ \Omega_0 = \frac{\tau \sigma^2_C}{2} $$ $$ \hat{x}_i = a_i\hat{x}_{i-1} + \frac{a_i \Omega_{i-1}}{\Omega_{i-1} + \delta^2_{i-1}} \left(x^*_{i-1} + \hat{x}_{i-1} \right) $$ $$ \Omega_i = \Omega_0 \left( 1- a_i^2 \right) + a_i^2 \Omega_{i-1} \left(1 - \frac{\Omega_{i-1}}{\Omega_{i-1} + \delta^2_{i-1}} \right) $$ $$ a_i = e^{-\left(t_i-t_{i-1}\right)/\tau} $$

To find the optimal parameters we maximize the likelihood with respect to $\sigma_C$ and $\tau$ and calculate $b$ as the mean magnitude of the light-curve divided by $\tau$.

In [75]:
data2 = np.random.normal(scale=3, size=1000)
mjd2=np.arange(1000)
error2 = np.random.normal(loc=0.01, scale =0.8, size=1000)
a = FeatureSpace(featureList=['CAR_sigma', 'CAR_tau','CAR_tmean'] , CAR_sigma = [mjd2,error2] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'CAR_sigma': 15.365838862970492, 'CAR_tmean': -1.4920083729422831, 'CAR_tau': 0.071978245330063695}
In [76]:
print np.var(data2), (15.95)**2 * 0.071/2
9.00070490472 9.03128875

Unit test

In order to check systematically the validity of each feature, every test mentioned above is added to a unit test file called "test_library.py". This script should be always ran before using the library by executing $ py.test. Also, if the user writes a new feature for the library, a pertinent test should be added to the unit test file. The idea is to guarantee, as far as possible, that every feature calculated is correct. In most of the cases this can be reached by calculating the expected feature value for a known distribution (normal, uniform, etc) and then by checking with the value obtained from the library.

The following image shows how py.test output should look if all the tests are passed:

In [177]:
Image(filename='unit2.png')
Out[177]:

References

[1] Falk, M., Marohn, F., Michel, R., Hofmann, D., Macke, M., Tewes, B., ... & Englert, S. (2011). A First Course on Time Series Analysis: Examples with SAS. An open source book on time series analysis with SAS.

[2] Kim, D. W., Protopapas, P., Alcock, C., Byun, Y. I., & Bianco, F. (2009). De-Trending Time Series for Astronomical Variability Surveys. Monthly Notices of the Royal Astronomical Society, 397(1), 558-568. Doi:10.1111/j.1365-2966.2009.14967.x.

[3] Kim, D. W., Protopapas, P., Byun, Y. I., Alcock, C., Khardon, R., & Trichas, M. (2011). Quasi-stellar object selection algorithm using time variability and machine learning: Selection of 1620 quasi-stellar object candidates from MACHO Large Magellanic Cloud database. The Astrophysical Journal, 735(2), 68. Doi:10.1088/0004-637X/735/2/68.

[4] Kim, D. W., Protopapas, P., Bailer-Jones, C. A., Byun, Y. I., Chang, S. W., Marquette, J. B., & Shin, M. S. (2014). The EPOCH Project: I. Periodic Variable Stars in the EROS-2 LMC Database. arXiv preprint Doi:10.1051/0004-6361/201323252.

[5] Nun, I., Pichara, K., Protopapas, P., & Kim, D. W. (2014). Supervised Detection of Anomalous Light Curves in Massive Astronomical Catalogs. The Astrophysical Journal, 793(1), 23. Doi:10.1088/0004-637X/793/1/23.

[6] Pichara, K., Protopapas, P., Kim, D. W., Marquette, J. B., & Tisserand, P. (2012). An improved quasar detection method in EROS-2 and MACHO LMC data sets. Monthly Notices of the Royal Astronomical Society, 427(2), 1284-1297. Doi:10.1111/j.1365-2966.2012.22061.x.

[7] Richards, J. W., Starr, D. L., Butler, N. R., Bloom, J. S., Brewer, J. M., Crellin-Quick, A., ... & Rischard, M. (2011). On machine-learned classification of variable stars with sparse and noisy time-series data. The Astrophysical Journal, 733(1), 10. Doi:10.1088/0004-637X/733/1/10.

In [178]:
picture = Image(filename='peanuts.jpg')
picture.size = (30, 30)
picture
Out[178]:
In []:
 
s="p">(featureList=['PercentDifferenceFluxPercentile'] ) a=a.calculateFeature(data2) print a.result(method='dict')
{'PercentDifferenceFluxPercentile': 211.28095967276019}

Percent amplitude

Largest percentage difference between either the max or min magnitude and the median.

In [103]:
data2 = np.random.normal(size=10000)
a = FeatureSpace(featureList=['PercentAmplitude'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'PercentAmplitude': 2010.2969651247706}

Mean variance $\frac{\sigma}{\bar{m}}$

This is a simple variability index and is defined as the ratio of the standard deviation, $\sigma$, to the mean magnitude, $\bar{m}$. If a light curve has strong variability, $\frac{\sigma}{\bar{m}}$ of the light curve is generally large.

For a uniform distribution the mean-variance should take a value close to 0.577:

In [73]:
data2 = np.random.uniform(size=1000000)
a = FeatureSpace(featureList=['Meanvariance'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'Meanvariance': 0.57733385945571203}

Median buffer range percentage (MedianBRP)

Fraction of photometric points within amplitude/10 of the median magnitude.

In [74]:
data2 = np.random.normal(size=1000000)
a = FeatureSpace(featureList=['MedianBRP'] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'MedianBRP': 0.658927}

CAR features

In order to model the irregular sampled times series we use CAR(1), a continious time auto regressive model. CAR(1) process has three parameters, it provides a natural and consistent way of estimating a characteristic time scale and variance of light-curves. CAR(1) process is described by the following stochastic differential equation:

$$ dX(t) = - \frac{1}{\tau} X(t)dt + \sigma_C \sqrt{dt} \epsilon(t) + bdt, $$ $$for \: \tau, \sigma_C, t \geq 0 $$

where the mean value of the lightcurve $X(t)$ is $b\tau$ and the variance is $\frac{\tau\sigma_C^2}{2}$. $\tau$ is the relaxation time of the process $X(T)$, it can be interpreted as describing the variability amplitude of the time series. $\sigma_C$ can be interpreted as describing the variability of the time series on time scales shorter than $\tau$. $\epsilon(t)$ is a white noise process with zero mean and variance equal to one. The likelihood function of a CAR(1) model for a light-curve with observations $x - \{x_1, \dots, x_n\}$ observed at times $\{t_1, \dots, t_n\}$ with measurements error variances $\{\delta_1^2, \dots, \delta_n^2\}$ is:

$$ p \left( x|b,\sigma_C,\tau \right) = \prod_{i=1}^n \frac{1}{[2 \pi \left( \Omega_i + \delta_i^2 \right)]^{1/2}} exp \{ -\frac{1}{2} \frac{\left( \hat{x}_i - x^*_i \right)^2}{\Omega_i + \delta^2_i} \} $$ $$ x_i^* = x_i - b\tau$$ $$ \hat{x}_0 = 0 $$ $$ \Omega_0 = \frac{\tau \sigma^2_C}{2} $$ $$ \hat{x}_i = a_i\hat{x}_{i-1} + \frac{a_i \Omega_{i-1}}{\Omega_{i-1} + \delta^2_{i-1}} \left(x^*_{i-1} + \hat{x}_{i-1} \right) $$ $$ \Omega_i = \Omega_0 \left( 1- a_i^2 \right) + a_i^2 \Omega_{i-1} \left(1 - \frac{\Omega_{i-1}}{\Omega_{i-1} + \delta^2_{i-1}} \right) $$ $$ a_i = e^{-\left(t_i-t_{i-1}\right)/\tau} $$

To find the optimal parameters we maximize the likelihood with respect to $\sigma_C$ and $\tau$ and calculate $b$ as the mean magnitude of the light-curve divided by $\tau$.

In [75]:
data2 = np.random.normal(scale=3, size=1000)
mjd2=np.arange(1000)
error2 = np.random.normal(loc=0.01, scale =0.8, size=1000)
a = FeatureSpace(featureList=['CAR_sigma', 'CAR_tau','CAR_tmean'] , CAR_sigma = [mjd2,error2] )
a=a.calculateFeature(data2)
print a.result(method='dict')
{'CAR_sigma': 15.365838862970492, 'CAR_tmean': -1.4920083729422831, 'CAR_tau': 0.071978245330063695}
In [76]:
print np.var(data2), (15.95)**2 * 0.071/2
9.00070490472 9.03128875

Unit test

In order to check systematically the validity of each feature, every test mentioned above is added to a unit test file called "test_library.py". This script should be always ran before using the library by executing $ py.test. Also, if the user writes a new feature for the library, a pertinent test should be added to the unit test file. The idea is to guarantee, as far as possible, that every feature calculated is correct. In most of the cases this can be reached by calculating the expected feature value for a known distribution (normal, uniform, etc) and then by checking with the value obtained from the library.

The following image shows how py.test output should look if all the tests are passed:

In [177]:
Image(filename='unit2.png')
Out[177]:

References

[1] Falk, M., Marohn, F., Michel, R., Hofmann, D., Macke, M., Tewes, B., ... & Englert, S. (2011). A First Course on Time Series Analysis: Examples with SAS. An open source book on time series analysis with SAS.

[2] Kim, D. W., Protopapas, P., Alcock, C., Byun, Y. I., & Bianco, F. (2009). De-Trending Time Series for Astronomical Variability Surveys. Monthly Notices of the Royal Astronomical Society, 397(1), 558-568. Doi:10.1111/j.1365-2966.2009.14967.x.

[3] Kim, D. W., Protopapas, P., Byun, Y. I., Alcock, C., Khardon, R., & Trichas, M. (2011). Quasi-stellar object selection algorithm using time variability and machine learning: Selection of 1620 quasi-stellar object candidates from MACHO Large Magellanic Cloud database. The Astrophysical Journal, 735(2), 68. Doi:10.1088/0004-637X/735/2/68.

[4] Kim, D. W., Protopapas, P., Bailer-Jones, C. A., Byun, Y. I., Chang, S. W., Marquette, J. B., & Shin, M. S. (2014). The EPOCH Project: I. Periodic Variable Stars in the EROS-2 LMC Database. arXiv preprint Doi:10.1051/0004-6361/201323252.

[5] Pichara, K., Protopapas, P., Kim, D. W., Marquette, J. B., & Tisserand, P. (2012). An improved quasar detection method in EROS-2 and MACHO LMC data sets. Monthly Notices of the Royal Astronomical Society, 427(2), 1284-1297. Doi:10.1111/j.1365-2966.2012.22061.x.

[6] Richards, J. W., Starr, D. L., Butler, N. R., Bloom, J. S., Brewer, J. M., Crellin-Quick, A., ... & Rischard, M. (2011). On machine-learned classification of variable stars with sparse and noisy time-series data. The Astrophysical Journal, 733(1), 10. Doi:10.1088/0004-637X/733/1/10.

In [178]:
picture = Image(filename='peanuts.jpg')
picture.size = (30, 30)
picture
Out[178]:
In []:
 
dwXg044mdplJMl5QcbPaUveCq24YDbLaZhqH/AIBOU9Apq7+QJWqpWq/MNK+YYP8A GybKaMXE13Q172Wi+UYBZbZWqtnLEvAic/UVK1SzCjcEHI3UdcmvvUTT4XctzAWLuvZmO4CAz7XS eYrKo6BfAcHTDo9b3IOZb1GJYRVKaU6iIL9nY/oMZ+rRglrSj9RUE4CZangwOGLyQs/38w1HUKkL 3fPJ6QXW4XxNmwYMMW0fMAjRms1gcljXcgnKVtNWLZAXtELtFoXakGDdPvupdZ4tRWdtQGG8Gf1L DIbqX9EoQdgAgP1otjtkja2+V3+kw8/9BHF68l96MyzxLPEsmgD7wALMATiWezvARpJYkXJvurgC FnNvPqArrErT+occZFkO8phY3L1O6DJ8RLSUK4pknIvMubiss8jXwgEIcx99WQ7Zm3VM+RuNNscW noBCpc27IxPTANTSEBelFnn6QtALlm7szmA5fEs0LqEApTW8VcK7ipQ7t4g1psLns2k8mNB/SIq8 LeP/ALlidrK7XvUAo6qwnQGi/wDDZLsa5c/1BEAW8bPqFC7RAfBAcUh2DsTtM1LV1jZwj0AVTrfF 0fEyP28yfGvxBic5L6CR9arpGB5Jt/8A0yURQq58DYlKgv8Ab7tjwANUlndVq/Ec/RQBVRfYcwK6 CV5dhXPjOIvm4ocGKqZcyI1oNqS1/qAOmgqHnSO/V9oDqpVEzdgwKuzhNWcRiT0FhAmbX+JdQKDv xhDUdMvawqYEoKN4g3eErNF7S+CD3IAv4Oh0u7G/3TmJWolBOQYe5F4YCgPa+Fc5mbzXqLI2B4/O pe0TgZ/bMto5cPzcCKLYFV+dy5RisDXftBDnGrfhbjzipf8AmncWJsH4cEawTWr4EwQDFtfmBpqe EGCAcOub2wag5YaR9BSGOEyN6YoFsO4qazed6YgLQGChXxKomrAH3DK8SKDAX2nctQK/gxMBNThf 6IbHBJf7VWJcCGoA8P1FVyW6CvywTX0P+z/wf/ZVl+cJ+4qOUtTcrQcERHJgU9kvxqVMlUAFreC8 HiNgA1k/y9LiJzxd/wB7CWZbrDQAC99wl2hprfaLY8atqbLfMAuz2aGy7qjh7t4jU/wTUWugXsIc ZhdkGXCMrLpLcE5wYj7CZIzkQfMVYfhD7uf+VH/lR/5UAjnoUyL69YsH2iQq01ZV/c/z/wCk5K1U JgY0eEgYNPNUYQ0Uu0h9u8L53kR93P8AypDmtrbPszHgg5in15/uBNDBi98u4MBHDR+mGBV3SzGc 4O1Q/wDGh/48OCSAqe1MDsbLRr73XxKIdkGF13VnUG5vGvq5hBXJdrCbR21jB5lNV5rjfbDC2S6t njEDi81VshCHfvKuqtO7KJ7xKQoK81aVPMGgkDVVABoKhrpSUlJUo7Eo7SpRKJR2lSiUlJRKJUo6 VKlSpUqVKlSiV0qVKlSiUSpUqVKlFVxK61KlEolJRKdpSUlEo7SjtKO0QlJUqVKO0o7SiU7SjtKO 0olEo7SjtKO0o7SjtKO0o7SiUSpR2lEqUSiUSjtE6f/Z " >
In []: